Optimize repeated pointer accesses in op/base reduction functions #9717

gkatev · 2021-12-01T12:07:49Z

Hi, this PR changes some repeated pointer accesses in the reduction function range checks, so that they are only performed (the pointer accesses) once. (Hope I didn't miss any, are there any reduction functions in another file?)

I wouldn't expect this change to make a difference, and I probably would expect the compiler to optimize the original code on its own, but I have experimentally observed increased performance with this change. Might well not be the case for a different system or compiler, but the changes shouldn't really have any downside.

Allreduce performance (latency) before patch
10 runs: Average 6374.9 us, Stddev 30.2834 us, Min 6327.88 us, Max 6420.74 us

Allreduce performance (latency) after patch
10 runs: Average 5776.3 us, Stddev 41.5984 us, Min 5705.17 us, Max 5828.30 us

Setup

osu_allreduce_cc @ 1M (data-altering osu_allreduce variant)
coll: libnbc,basic
pml: ob1
btl: vader,self (xpmem)
bind-to: core
map-to: slot

CentOS 8, GCC 8.5.0, aarch64 (single node)
--enable-debug not specified during config

Disclaimer: Tested on v5.0.0rc2

Signed-off-by: George Katevenis gkatev@ics.forth.gr

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

ompiteam-bot · 2021-12-01T12:07:51Z

Can one of the admins verify this patch?

jsquyres · 2021-12-01T12:19:51Z

ok to test

jsquyres · 2021-12-01T12:22:10Z

I'm a little surprised that we left that pointer-dereference assignment in all those loops, but I guess it just got missed.

I double checked MPI-4.0 to make sure that the len argument isn't an INOUT, and it doesn't appear to be. MPI-4.0 p235:13-16 says:

The len argument ... is passed in as a reference for reasons of compatibility with Fortran.

@bosilca Can you verify?

ggouaillardet · 2021-12-01T13:37:07Z

@jsquyres the Git commit checker did not run (even if you issue the bot command`) because this is a new contributor (!), so I had to manually hit a 'Approve and Run' button (iirc) in the web interface.

ggouaillardet · 2021-12-01T13:39:58Z

@gkatev thanks for the patch!

Out of curiosity, did you make any performance measurements (with Open MPI compiled for production - e.g. -O2 or more aggressive)?
Though the patch looks correct at first glance, I would naively suspect the compiler optimize these pointer dereferences, ending up generating the same code with or without this patch.

devreal · 2021-12-01T14:06:59Z

This might be an aliasing issue: the compiler cannot assume that count isn't aliased by another pointer, unless it's marked as restrict. Hence the repeated load from memory.

gkatev · 2021-12-01T14:35:42Z

@ggouaillardet I think I am not seeing a difference with O2/O3, but am I doing it correctly like this?
CFLAGS=-O2 make -C ompi
Or is there a config-time or otherwise way to tune the optimization level?

awlauria · 2021-12-01T14:46:03Z

@gkatev I believe you'll need to pass in CFLAGS at configure time for it to have any effect.

you can check with make -V2 to see it got passed in.

awlauria · 2021-12-01T14:47:16Z

FWIW in past projects I have seen these little optimizations making a difference.

Compilers should be smart enough, but can't always rely on that.

bosilca · 2021-12-01T21:36:04Z

Check the assembly at godbolt for 4 different versions of the sum int32_t. Depending on the compiler, the generated code is slightly different. clang seems to assume the pointer to count is restrict and always generates the same optimized code, while gcc is taking a very cautious approach. In general the difference is reloading the *count from the cache instead of working on it from a register.

Conclusion, to remove the compiler from the equation and always be sure the code is as optimal as possible either we add restrict or we do the trick proposed here (temporary variable).

bosilca · 2021-12-01T21:38:51Z

However, I think I would rather have the code with the following pattern:

void
ompi_op_base_2buff_sum_int32_t(const void *in, void *out, int *count)
  {
      int i;
      int32_t *a = (int32_t *) in;
      int32_t *b = (int32_t *) out;
      for (i = *count; i > 0; i--) {
          *(b++) += *(a++);
      }
  }

jsquyres · 2021-12-01T22:38:23Z

@bosilca Just curious: why keep it incrementing through a pointer dereference?

bosilca · 2021-12-01T23:11:14Z

Not sure I understand the question. Are we talking about count or about the other pointers ?

jsquyres · 2021-12-01T23:43:24Z

@bosilca You're right... my question was non-sensical. Disregard!

devreal · 2021-12-02T00:02:09Z

I thought the point of @bosilca 's comment was to walk backwards (i--) instead of forward (i++), which may be more cache efficient (assuming that an application wrote the last bytes last and will access again from the first bytes)

devreal · 2021-12-02T01:06:59Z

Ahh never mind, the pointers are still walking forward 🤦

Remove all ops with 3 buffers, we ended up not using them anywhere in the code. Change the loop order in the base MPI_Op to allow for more optimizations, as discussed in open-mpi#9717. Fixes open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bosilca · 2021-12-02T06:34:56Z

I created #9719 as an alternative to this. While digging in the code I also noticed we are not using the 3buff MPI_Op so I took the opportunity to remove all support.

Remove all ops with 3 buffers, we ended up not using them anywhere in the code. Change the loop order in the base MPI_Op to allow for more optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

gkatev · 2021-12-02T10:32:30Z

FYI, for curiosity's sake I checked with make V=2, and it appears that all commands involving op_base_functions.c do contain -O3. (by default, without explicitely asking for it during configuration/compilation) -- and therefore all my tests here should have contained it.

ggouaillardet · 2021-12-03T12:40:49Z

@devreal you are absolutely right about pointer aliasing! and that makes the patch necessary to improve performance when the compiler chooses to generate correct code.

optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 1215776)

jsquyres · 2022-01-23T14:44:18Z

We merged #9719 and #9867 proposes removing the 3-buf variants. Should this PR now be closed?

Optimize repeated pointer accesses in op/base reduction functions

a0f020f

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

awlauria approved these changes Dec 1, 2021

View reviewed changes

bosilca mentioned this pull request Dec 2, 2021

Rework the MPI_Op support. #9719

Merged

bosilca added a commit to bosilca/ompi that referenced this pull request Dec 2, 2021

Change the loop order in the base MPI_Op to allow for more

1215776

optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

jsquyres added the Target: main label Jan 3, 2022

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Jan 23, 2022

Change the loop order in the base MPI_Op to allow for more

1465b46

optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 1215776)

gkatev closed this Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize repeated pointer accesses in op/base reduction functions #9717

Optimize repeated pointer accesses in op/base reduction functions #9717

gkatev commented Dec 1, 2021

ompiteam-bot commented Dec 1, 2021

jsquyres commented Dec 1, 2021

jsquyres commented Dec 1, 2021

ggouaillardet commented Dec 1, 2021

ggouaillardet commented Dec 1, 2021

devreal commented Dec 1, 2021

gkatev commented Dec 1, 2021

awlauria commented Dec 1, 2021

awlauria commented Dec 1, 2021

bosilca commented Dec 1, 2021 •

edited

Loading

bosilca commented Dec 1, 2021 •

edited

Loading

jsquyres commented Dec 1, 2021

bosilca commented Dec 1, 2021

jsquyres commented Dec 1, 2021

devreal commented Dec 2, 2021

devreal commented Dec 2, 2021

bosilca commented Dec 2, 2021

gkatev commented Dec 2, 2021

ggouaillardet commented Dec 3, 2021

jsquyres commented Jan 23, 2022

Optimize repeated pointer accesses in op/base reduction functions #9717

Optimize repeated pointer accesses in op/base reduction functions #9717

Conversation

gkatev commented Dec 1, 2021

ompiteam-bot commented Dec 1, 2021

jsquyres commented Dec 1, 2021

jsquyres commented Dec 1, 2021

ggouaillardet commented Dec 1, 2021

ggouaillardet commented Dec 1, 2021

devreal commented Dec 1, 2021

gkatev commented Dec 1, 2021

awlauria commented Dec 1, 2021

awlauria commented Dec 1, 2021

bosilca commented Dec 1, 2021 • edited Loading

bosilca commented Dec 1, 2021 • edited Loading

jsquyres commented Dec 1, 2021

bosilca commented Dec 1, 2021

jsquyres commented Dec 1, 2021

devreal commented Dec 2, 2021

devreal commented Dec 2, 2021

bosilca commented Dec 2, 2021

gkatev commented Dec 2, 2021

ggouaillardet commented Dec 3, 2021

jsquyres commented Jan 23, 2022

bosilca commented Dec 1, 2021 •

edited

Loading

bosilca commented Dec 1, 2021 •

edited

Loading